AITopics | frame caption

Collaborating Authors

frame caption

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

381ceeae4a1feb1abc59c773f7e61839-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 06:07:42 GMT

caption, frame caption, video caption, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
Asia > China > Hong Kong (0.04)

Industry: Leisure & Entertainment > Games (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

381ceeae4a1feb1abc59c773f7e61839-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 06:07:38 GMT

language model, proceedings, video, (10 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > Dominican Republic (0.04)
Asia > China > Hong Kong (0.04)
(18 more...)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

Phukan, Arpan, Gupta, Manish, Ekbal, Asif

arXiv.org Artificial IntelligenceOct-13-2024

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending ``People Also Ask'' questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available. https://github.com/thePhukan/ECIS-VQG

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2410.09776

Country:

North America > United States (0.28)
Europe (0.14)
Asia > Nepal (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Education (1.00)
(3 more...)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Wang, Zhenhailong, Li, Manling, Xu, Ruochen, Zhou, Luowei, Lei, Jie, Lin, Xudong, Wang, Shuohang, Yang, Ziyi, Zhu, Chenguang, Hoiem, Derek, Chang, Shih-Fu, Bansal, Mohit, Ji, Heng

arXiv.org Artificial IntelligenceOct-13-2022

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .

caption, large language model, question answering, (17 more...)

arXiv.org Artificial Intelligence

2205.10747

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > Dominican Republic (0.04)
Asia > China > Hong Kong (0.04)
(18 more...)

Genre: Research Report (0.82)

Industry:

Media (1.00)
Leisure & Entertainment > Games (0.93)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.55)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.54)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback